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Abstract 



Logistic regression has been widely apphed in the field of biomedical research for 
^ . a long time. In some applications, covariates of interest have a natural structure, 

such as being a matrix, at the time of collection. The rows and columns of the 
covariate matrix then have certain physical meanings, and they must contain useful 
information regarding the response. If we simply stack the covariate matrix as a 
vector and fit the conventional logistic regression model, relevant information can 
be lost, and the problem of inefficiency will arise. Motivated from these reasons, 
we propose in this paper the matrix variate logistic (MV-logistic) regression model. 
Advantages of MV-logistic regression model include the preservation of the inher- 
ent matrix structure of covariates and the parsimony of parameters needed. In the 
EEG Database Data Set, we successfully extract the structural effects of covariate 
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matrix, and a high classification accuracy is achieved. 

KEY WORDS: Asymptotic theory; Logistic regression; Matrix covariate; Regular- 
ization; Tensor objects. 

1 Introduction 

Logistic regression has been widely applied in the field of biomedical research for a long 
time. It aims to model the logit transformation of conditional probability of an event 
as a linear combination of covariates. After obtaining the data, maximum likelihood 
estimates (MLE) can then be used to conduct subsequent statistical analysis such as 
prediction and interpretation. When the number of covariates excesses the available 
sample size as is usually the case in modern research, the penalized logistic regression 
(Lee and SilvapuUe, 1988; Le Cessie and Van Houwelingen, 1992) is proposed to estimate 
parameters by maximizing the penalized log-likelihood function, and to avoid the problems 
due to high dimensionality. Recently, penalized logistic regression has been applied to the 
filed of biomedical research, such as cancer classification, risk factor selection, and gene 
interaction detection, etc. See Zhu and Hastie (2004) and Park and Hastie (2008) among 
others. We refer the readers to MucuUoch and others (2008) and Hastie and others (2009) 
for details and further extensions of logistic regression. 

In some applications, covariates of interest have a natural matrix structure at the time 
of collection. For example, research interest focuses on predicting the disease status of a 
patient based on his/her image of certain organ. In this situation, we actually obtain the 
data set of the form {{Yi, Xj)}"^^ which are random copies of {Y, X), where y is a binary 
random variable with value 1 indicating diseased and otherwise, and X is a. pxq covariate 
matrix representing the corresponding image of the i-th patient. Another example, the 
Electroencephalography (EEG) Database Data Set which will be analyzed in this article 
later, concerns the relationship between the genetic predisposition and alcoholism. In 
this study, Y = 1 and Y = represent alcoholic and control groups, respectively, and the 
covariate X is a 256 x 64 matrix, where its (i, j)-th element X(jj) is the measurement of 
voltage value at time point i and channel of electrode j. 
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With matrix covariate, we usually stack X column by column as a pg-vector, say 
vec(X), and subsequent statistical analysis proceeds in the usual way. Specifically, con- 
ventional logistic regression model admits that each j) possesses its own effect ^ij on 
the response Y. Motivated from the matrix structure of X, it is natural to store ^i/s 
into a. p X q parameter matrix ^. Then, conventional logistic regression model takes the 
formulation 



where 7 is the intercept term. It is obvious that model does not consider the inherent 
matrix structure of X, as the corresponding parameter matrix ^ enters model ([1]) only 
through vec(^) while its matrix structure is ignored. As X is a matrix at the time of 
collection, it is reasonable for ^ to possess certain structure. For example, those X(j ,,)'s 
in a common region or in the same column (row) of X may have similar effects on the 
response. Directly fitting ([1]) without considering the structural relationships among ^j^'s 
will cause the problem of inefficiency. Moreover, by ignoring the inherent matrix structure, 
one can hardly identify the row and column effects from their joint effect matrix ^, when 
the structural effects of X are the interest of the study. Another problem of model ([1]) 
is that the number of parameters usually becomes extremely large in comparison with 
the sample size. For instance, we have 1 + 64 x 256 = 16385 parameters for the EEG 
Database Data Set when fitting model ([1]), while the available sample size is 122 only. As 
high dimensionality makes statistical inference procedure unstable, it becomes urgent to 
seek a more efficient method in dealing with matrix covariate. The main theme of this 
research is thus to overcome the above mentioned problems, via incorporating the hidden 
structural information of X into statistical modeling. 

The rest of this article is organized as follows. Section 2 introduces the matrix variate 
logistic regression model. Its statistical meaning is also discussed. Section 3 deals with 
the asymptotic properties of our estimators and the implementation algorithm. The 
proposed method is further evaluated through two simulation studies in Section 4 and 
the EEG Database Data Set in Section 5. Section 6 illustrates the extension of matrix 
variate logistic regression model to the case of multiple classes. The paper is ended with 



logit{P(F = l|X)} = 7 + ^e.,X(M) 



= 7 + vec(0 vec(X), 
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conclusions in Section 7. 

2 Matrix Variate Logistic Regression Model 
2.1 Model specification 

To incorporate the structural information into modeling, as motivated from the matrix 
structure of X, we propose the matrix variate logistic (MV-logistic) regression model 

logit{P(F = l|X)} = 7 + a^X/3, (2) 

where a = (ai,--- ,ap)'^ and f3 = , (3q)'^ are the row and column parameter 

vectors, respectively, and 7 is the intercept term. As the matrix structure of X is preserved 
in model ([2]), these parameters have their own physical meanings. In the EEG Database 
Data Set, for instance, a is interpreted as the effect of different time points, and /3 as 
the effect of different channels. By fitting MV-logistic regression model, we are able to 
extract the column (row) information of X which will provide further insights into the 
relationship between Y and X. Note that under model ([2]), can only be identified 

up to a scale, since {c~^a, c/3) will result in the same model for any constant c 7^ 0. For 
the sake of identifiability, without loss of generality, we assume ai = 1 in our derivation 
(see Remark 12.11 for further discussion). Denote the rest of parameters in a by a*, i.e., 
a = (1, a*^)"^, and 9 = (7, a*'^, P'^y are the parameters of interest. We thus have p + q 
free parameters contained in 9, while it is + 1 in the conventional logistic regression 
model (pP). One can see that a merit of model is the parsimony of parameters used. 
Thus, when model ([2]) is correctly specified, an efficiency gain in is expected. 

Adoption of model ([2]) is equivalent to modeling the covariate-specific odds ratio Rij 
of X(i,j) while keeping the rest covariates fixed as 

log(i?,,) = ^ R,, = {exp(/3,■)}"^ (3) 

Thus, the effect of X(ij) depends on the product instead of or (5j solely. A positive 
aiPj implies Rij > 1. Note that since we set ai = 1, if Pj > (< 0), then > 1 (< 1) 
indicates Rij > Rij. Take the EEG Database Data Set to exemplify again, relation ([3]) 
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implies that each channel has its own baseline odds ratio exp(/3j). Depending on the time 
point of being measured, say i, the odds ratio is further modified by taking a power of ai. 

Remark 2.1. Although at the population level there is no difference for which ai is set 
to one, it is crucial to practical implementation. If ai is near zero, setting ai = 1 will 
lead to unstable results. Here we provide an easy guidance to select the baseline effect. 
Let pij be the sample correlation coefficient between X(j j) and Y . We then set ai = 1 if 
i = argmax;j{^^^-^ \pkj\ '■ k = 1, ■ ■ ■ ,p} . The intuition is to choose the one as the baseline 
which is the most likely to be correlated with the response. We find in our numerical 
studies this simple approach performs well. 

Remark 2.2. Standardization of covariate will not affect the final result of logistic regres- 
sion model except a change of scale in parameter estimates, since the standard deviation 
of each covariate can be absorbed into the corresponding parameter. In MV-logistic re- 
gression model, however, we have pq covariates but only p + q free parameters and, hence, 
it is generally impossible to absorb those standard deviations into fewer parameters. In 
summary, standardization of covariates will result in a different MV-logistic regression 
model. We will further discuss this issue in Section 5. 

2.2 Statistical meaning of MV-logistic regression model 

This subsection devotes to investigating the statistical meaning of MV-logistic regression 
model. We will assume the validity of the general model ([T]) with (70, ^0) being the true 
value of (7,0- O'^^ will see that fitting MV-logistic regression model actually aims to 
estimate the best rank-1 approximation of the true parameter matrix ,^0; in the sense of 
minimum KuUback-Leibler divergence (KL-divergence), or equivalently, maximum like- 
lihood. This observation supports the applicability of MV-logistic regression model in 
practice. 

First observe that, as = vec(a/3"^)"^vec(X), MV-logistic regression model ([2]) is 

equivalent to the conventional model ([T]) with the constraint ^ = Thus, MV-logistic 
regression utilizes the matrix structure of and approximates it by a rank-1 matrix a[3'^ in 
model fitting. Secondly, it is known that maximizing the likelihood function is equivalent 
to minimizing the KL-divergence (Bickel and Doksum, 2001). Let /(?/|X;7,,^) be the 
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conditional distribution function of Y given X under model ([I]). The KL-divergence 
between (70, ■Co) and any (7,^ is defined as 

i^L(7o,eo||7,0 = ^x{y" (log^^^^;^^/(y|X;7o,eoMl/}, (4) 

where Ex{-) takes expectation with respect to X. Combining the above two facts, at 
the population level, fitting MV-logistic regression to estimate (7, a, (3) is equivalent to 
searching the minimizer of the minimization problem 

mill 1-^^(70, ^o||7,«/3^)- (5) 

Let (ao, /So) be the minimizer of (a, /3) in ([S]). MV-logistic regression then aims to search 
the matrix ao/3o^, which is called the best rank-1 approximation of the true parameter 
matrix ^o- Notice that the optimality between ^0 and ao/3(f here is not measured by the 
usual Frobenius norm, but the KL-divergence in ([5]) instead (see Remark 12.31 for more 
explications). Therefore, when ,^0 = c^o/^lf, MV-logistic regression must be more efficient 
than the conventional approach to estimating ,^0 by the usual MLE argument. Even if ^0 
does not equal ao/3(f , we are still in favor of MV-logistic regression, especially when the 
sample size is relatively small. In fact, there is a trade-off between "correctness of model 
specification" and "efficiency of estimation". With limited sample size available, instead 
of estimating the full parameter matrix ^0 which will suffer from the problem of unstable 
estimation, MV-logistic regression model aims to more efficiently estimate the best rank-1 
approximation of ^o- 

Clearly, the performance of MV-logistic regression model relies on "how well the true 
parameter matrix ^0 can be approximated by a rank-1 matrix ao/^if"- As demonstrated 
in Section 5 that MV-logistic regression outperforms the conventional approach in the 
EEC Database Data Set, we believe this condition for ^0 is not restrictive based on the 
following two reasons. First, it is reasonable to assume that most covariates have effect 
sizes near zero in modern biomedical research and, hence, ^0 is plausible to be a low-rank 
matrix. The second reason is based on the characteristic of the underlying study. In the 
EEC study, for instance, measurements at the same time points (channels) would have 
similar effects. These facts reflect the potential of .^0 to be well explained by low rank 
approximation and, hence, the validity of MV-logistic regression model. The robustness 
of MV-logistic regression model will be further studied in Section 4.2. 

6 



Remark 2.3. Given the matrix ^q, the weighted rank-1 approximation problem of 
with the weight matrix W (Lu and others, 1997; Manton and others, 2003) is to find 
the minimizer (ao,/3o) of the minimization problem minQ,^^ ||^o — where || ■ ||^ — 

vec(-)-^W^vec(-). The resulting matrix a^P^ is called the best rank-1 approximation of 
in the sense of minimum \\ ■ \\w norm. When W = Ipg, the || ■ norm reduces to the 
Frobenius norm, and ao/^Q can be obtained by singular value decomposition (SVD) of ^q. 
Paralleling to this idea, fitting MV-logistic regression model (with ^ being the estimation 
criterion) can be treated as finding the best rank-1 approximation of the sense of 

minimum KL- divergence 

3 Statistical Inference Procedure 

Some notations are defined here for ease of reference. For any function g, g^^^ represents 
the /c-th derivative of g with respect to its argument. The parameters of interest are 
6 = (7, f3^f. Denote P{Y, = l\Xi) by tt, = 7i{9\Xi) = exp(7 + a^Xif3){l + exp(7 + 
a^Xif3)}^^ . Define n(^) = (vri, ■ ■ ■ ,7in)^ and V{9) = diag(wi,--- ,Vn), where Vi = 
7rj(l — TTj) is the conditional variance of Yi given Xj. Let Y = (Yi, ■ ■ ■ , and X(^) = 
[ Xi(^),--- ,X„(^) f be an n X pq matrix, where Xi(^) = (1, C, a^X^)^, C = 
da/da* = [Op_i, Ip_i]"^, Oq is the a-vector of zeros, and is the a x a identity matrix. 
One can treat Xj(6') as the working covariates of the z-th subject, which will be used in 
the development of our method. Note that C depends on which element of a is set to 
one. As «! = 1 throughout the paper, the notation C is used for simplicity. 

3.1 Estimation and Implementation 

The estimation of 6 mainly relies on maximum likelihood method. Given the data set 
{(li, the log-likelihood function of 9 is derived to be 

n 

m = ^ r,(7 + a^X^P) - log{l + exp(7 + a^X,(3)}, (6) 

and 6 can be estimated by the maximizer of £{6). 

In modern research of biostatistics, however, an important issue is the number of 
covariates could be large in comparison with the sample size. This will make the estimation 

7 



procedure unstable, or traditional methodologies may even fail. As mentioned in the 
previous section, the number of free parameters in model ([2]) is p + q, while it is + 1 in 
the conventional logistic regression model. MV-logistic regression then suffers less severity 
from high dimensionality. However, this can not entirely avoid the problem of instability. 
As we have seen in the EEG Database Data Set, the covariate X is a 256 x 64 matrix (i.e., 
320 parameters in MV-logistic regression model), while there are only 122 observations. 
To overcome the difficulty of high dimensionality, Le Cessie and Van Houwelingen (1992) 
proposed the penalized logistic regression method. This motivates us to further consider 
the penalized MV-logistic regression. Let J{6) > be a twice continuously differentiable 
penalty function of 6, and A > be the regularization parameter. We propose to estimate 
^by 

= argmax4(6l), (7) 
e 

where i\{0) = £{6) — XJ{6) is the penalized log likelihood. There are many choices 
of J(-) depending on different research purposes, wherein J{9) = 116*11^/2 and J{9) = 
(II a* IP + ||/3|p)/2 are the most widely applied ones. The difference between them is 
whether to put the penalty on the intercept term 7 or not. The regularization parameter 
A should also be determined in practice. A commonly used approach is to select A through 
maximizing the cross-validated classification accuracy. Other selection criteria can be 
found in Le Cessie and Van Houwelingen (1992). 

Interpretations of the elements of 9x have been introduced in Section 2.1. Let 9x = 
(7, a*'^, and a = (1, d*^)-^. The odds ratio Rij in ([3]) is estimated by exp{aif3j). We 
would also be interested in the success probability 7r(6'|x) for any given p x q matrix x, 
which can be estimated by 'k{9x\x). A discrimination rule is then to classify a subject 
with matrix covariate x to the "diseased" group if 7r(6'A|a;) > 0.5 and the "non-diseased" 
group otherwise. Detailed inference procedures about 9 and 7r(6'|a;) will be discussed in 
the next subsection. 

We close this subsection by introducing the iterative Newton method to obtain 9x- 
The gradient of ix{9) (with respect to 9) is calculated to be 

i^^\9) = e^'\9)-XJ^'\9), (8) 
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where i^^\9) = X(6')^{Y — n(^)}. Moreover, the Hessian matrix oi£x{9) is derived to be 





EtlC^My^-^^ 

EtiXrC{Y,-n,) 



(9) 



where 



Hx{9) = X{9fV{9)X{9) + XJ^^\9). 



(10) 



As suggested by Green (1984), we will ignore the last term of (jH]) since its expectation is 
zero. Finally, 9x can be obtained through iterating 



= + {Hx{9(k))y'i'x\9^k)), k = 0,1,2, 



:iii 



until there is no significant difference between 6'(fc+i) and 6'(fc), and outputs 9^ = 9<^k+i)- A 
zero initial 6'(o) is suggested and performs well in our numerical studies. 

3.2 Asymptotic Properties 

Asymptotic properties of 9\ can be derived through usual arguments of MLE. The result 
is summarized in Theorem 13.11 below. 

Theorem 3.1. Assume the validity of model ^ and the regularity conditions of the 
likelihood function. Assume also the information matrix I{9) = £'{Xj(^)t;jXj(^)^} is 
nonsingular. Then, for any fixed \, \/n{9\ — 9) converges weakly to A^(0, S(6')), where 



From Theorem 13. H 9x is shown to be a consistent estimator of 9. It also enables us 
to construct a confidence region of 9, provided we have an estimate of S(^). This can 
be done by the usual empirical estimator with the unknown 9 being replaced by 9\. In 
particular, define 



n 



Hx{9) 



-X{9fV{9)X{9)\ (-H,{9) 
n \n 



(12) 



We propose to estimate the asymptotic covariance matrix S(^) by T,{9\). For any < 
a < 1, an approximate 100(1 — a)% confidence interval of 9i, the i-th element of 9, is 
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constructed to be 

(o . P(^^a)]. , , ^ F(^a)]A .... 

- + y (13) 

where 2a/2 is the 1 — a/2 quantile of standard normal and [E(0A)]j denotes the i-th diagonal 
element of S(^a)- As to making inference about 7r(6'|x) for any p x g matrix x, by applying 
delta method and the result of Theorem 13.11 we have 

where a\[Q\x) = x{6)'^'E{6)x{6) and x{6) = {1^ 0^x^C^a^x)'^ . The asymptotic variance 
a1{6\x) can be estimated by a1{6x\x), where a1{6\x) = x{9)^T,{9)x{9). An approximate 
100(1 — a)% confidence interval of 7r(6'|x) is then constructed to be 

/ exp(7 + d^x/3-Z|^^^) exp(7 + d^z/3 + 2|^^^) 

yi + exp(7 + «^x/3-2|^^^) ' l + exp(7 + «^x/3 + ^|^^^) 

which is guaranteed to be a subinterval of [0, 1]. 

4 Simulation Studies 

The proposed method is evaluated through simulation under two different settings. Let 
the parameters be set as 7 = 1, a = (1, 0.5, — O.Sl^.g)^, and (3 = (1, 0.5, 1, — l^^g)-^, 
where la is the a- vector of ones, and the penalty function J{6) = (||«*||^ + ||/3|p)/2 is 
considered. The tuning parameters for different logistic regression models are determined 
from an independent simulation by maximizing the classification accuracy. Simulation 
results are reported with 1000 replicates. 

4.1 Simulation under MV- logistic regression model 

The first simulation study evaluates the proposed method under model with (p, q) = 
(12, 10). We first generate X such that vec(X) follows a pg-variate normal distribu- 
tion with mean zero and covariance matrix Ipg. Conditional on X, F is generated from 
model ([2]) with the specified 6. The averages of 6x and standard errors from the diagonal 
elements of S(^a) in (|T2l) are provided in Table [1] for n = 150,300. For the case of small 
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sample size n = 150, biases for 6x are detected. The biases are mainly due to the penalty 
term XJ{6), as is the case of ridge regression. In fact, from the proof of Theorem 13. H the 
bias term is derived to be Xn^^{I{9)J^^''{9)} which is of order n~^. The biases, however, 
are all relatively small in comparison with the the corresponding standard deviations of 6x, 
and decrease as the sample size increases. Moreover, all the standard deviations are well 
estimated by the diagonal elements of S(6'a). These observations validate Theorem 13.11 
and the proposed empirical variance estimator. 

As 7 + a'^XP = 7 + ® a^)\/ec{X) is the critical component used in prediction, 
we also report the averages of similarities n'^t'/dlnjl ||f ||) with u = ® a^)^ and 

V = (7, /3-^ ® a'^)'^ in the last row of Table [TJ Although biases arise especially for the 
case oi n = 150, the similarities are not affected and have values very close to one in both 
cases. This means MV-logistic regression would have good performance in classification, 
since it is the direction of (7, f3'^ ® a'^)'^ that is relevant to classification. To demonstrate 
this, in each simulation replicate we also generate another independent data set, and 
calculate the classification accuracy by using 7 + cP^Xp. For comparison, we also fit the 
conventional logistic regression model ([1]) to obtain estimates of (7, C.) as if we ignore 
the relationship ^ = aP'^, and denote it by (7,0- The classification accuracy obtained 
from 7 + vec(.^)"^vec(X) is also calculated. The averaged classification accuracies and the 
winning proportions (the proportion of MV-logistic regression with higher classification 
accuracy over 1000 replicates) under n = 150 are placed in Table |2] (the row indexed by 
cr = 0). It is detected that MV-logistic regression produces more accurate results, and 
the superiority of MV-logistic regression becomes more obvious for larger (p, q) values. 

4.2 Simulation violating MV-logistic regression model 

In the previous numerical study, MV-logistic regression outperforms conventional logistic 
regression when data is generated from model ([2]). It is our purpose here to evaluate 
the performance of MV-logistic regression, when the underlying distribution departures 
from model ([2]). The same setting in Section 4.1 is used, except for each simulation 
replicate, Y is generated from the conventional logistic regression model ([1]) with 7 = 1 
and ^ = + 5, where each element of the px q matrix 6 is also randomly drawn from a 
normal distribution with mean zero and variance o"^. With an extra term 6, model ([2]) is 
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violated, and the magnitude of violation is controlled by a. Both models ([I]) and ([2]) are 
fitted to obtain the estimates (7,0 (7,0 with ^ = af3'^, respectively. We compare 
the classification accuracies of the predictors 7 + d^X/3 and 7 + vec(0"^vec(X) applied on 
another independent data set, under different combinations of [p, q) and a = 0.1, 0.3, 0.5. 
Note that cr = means MV-logistic regression is the true model. To see how these a values 
affect the deviation from MV-logistic regression model, we also report the corresponding 
explained proportions p^- defined below. Let k = p A q and si > ■ • ■ > > be the k 
non-zero singular values of each generated ^. Then is the average (over 1000 replicates) 
of si/{Yl'£=i ^^)- Here < po- < 1 measures how well ^ can be explained by its best rank- 
1 approximation. Small value of indicates severe violation of MV-logistic regression 
model. 

The analysis results with n = 150 are placed in Table [2l For every a, conventional 
logistic regression is the correct model, and its classification accuracies are irrelevant to a 
but will decay rapidly as pq increases. On the other hand, MV-logistic regression is less 
affected by {p, q) than conventional approach as it involves only p + q parameters. The 
cost for its parsimony of parameters is that, for any fixed (p, g), its performance becomes 
worse for larger a. Overall, for moderate deviations a < 0.3, MV-logistic regression 
outperforms the conventional approach for every combination of (p, q). For a = 0.5, MV- 
logistic regression has lower classification accuracy at {p,q) = (12,10). We note that in 
this case, p^ = 33% implies a large deviation from MV-logistic regression model, while 
conventional logistic regression is expected to have better performance as the number of 
parameters is 121 < n = 150. For larger {p, q) values, however, MV-logistic regression 
will be the winner even in the most extreme case of (p, q) = (20, 30), where the explained 
proportion p^- is 22% only. It indicates that when p and q are large, the gain in efficiency 
(from fitting MV-logistic regression model) more easily exceeds the loss due to model 
misspecification. These observations show that MV-logistic regression model has certain 
robustness against the violation of model specification, especially for large number of 
covariates. 
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5 The EEG Database Data Set 



In this section, we analyze the EEG Database Data Set to demonstrate the usefulness of 
MV-logistic regression. The data set consists of 122 subjects, wherein 77 of them belong 
to the group of alcoholism [Yi = 1) and the rest 45 subjects are in the control group {Yi = 
0). Each subject completed a total of 120 trials under three different conditions (single 
stimulus, two matched stimuli, and two unmatched stimuli). In each trial, measurements 
from 64 electrodes placed on subject's scalp at 256 time points are collected, which results 
in a 256 x 64 covariate matrix. It is interested to distinguish two types of subjects based on 
the collected matrix covariates. The data set can be downloaded from the web site of UCI 
Machine Learning Repository {http://archive.ics.uci. edu/ml/datasets/EEG+Database) . 

The EEG Database Data Set was recently analyzed by Li and others (2010) whose 
main purpose focused on dimension reduction. Here we adopt a similar strategy for data 
preprocessing. In particular, we consider partial data set of single-stimulus experimenters 
only, and the averaged matrix covariates over different trials of the same subject (denoted 
by X*) will be considered in our analysis. This data setting is same with the one used in Li 
and others (2010). Note that with 256 x 64 covariate matrix, we have 320 free parameters 
which is still an excess of the sample size 122. Before fitting MV-logistic regression model, 
the generalized low rank approximations (GLRAM) of Ye (2005) is performed to reduce 
the dimensionality of X* first. GLRAM is an extension of principal component analysis 
to matrix objects, which aims to find orthogonal bases A G Rp^p" and B G M'^^'?" with 
Po < P and Qo < Q such that X* is well explained by the lower dimensional transformation 
A^X*B. Detailed analysis procedure is listed below. 

1. Apply GLRAM to find A and B under (po, go) = (15, 15). Define X* = A^X*B. 

2. Standardize each element of X* to obtain the covariate matrix Xj. 

3. Fit MV-logistic regression with {{Y^, Xi)}}^^. We apply the rule suggested in Re- 
mark to set as = 1 and denote the rest a^'s by a*. 

To estimate 6 in Step 3, we adopt the penalty function J{6) = 116*11^/2. The penalty 
A = 24 is chosen so that the leave-one-out classification accuracy is maximized. The 
resulting estimates of a and /3 are provided in Figured] (a)-(b) with the corresponding 
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95% confidence intervals constructed from (fT3ll . As many estimates of a^'s and /3j's are 
significantly different from zero, channels and measurement times surely play important 
roles in distinguishing alcoholic and control groups. Observe that all the estimates of a* 
are smaller than 1. Thus, for those channels with [5j > 0, the effects of measurement 
times are all smaller than the third time point. In other words, most channels achieve 
the largest odds ratio at the third time point. For the rest channels with Pj < 0, the 
largest effect happens at the 14-th time point. Moreover, among all combinations of time 
points and channels, X(3 §) has the largest odds ratio (since a^^s is the largest among 
all dj/3j) and would be critical in classifying alcoholic and control groups. Figured] (c) 
provides the predicted probability of being alcoholism for every subject (by using the rest 
121 subjects) as well as the 95% confidence interval from f[T5]) . and Figured] (d) gives 
the kernel density estimates of a'^Xif3 for two types of subjects. An obvious separation 
of two groups is detected which demonstrates the usefulness of MV-logistic regression in 
classification. 

The choice of {po, go) = (15, 15) in Step 1 is the same with the data preprocessing 
step of Li and others (2010). Under this choice, we correctly classify 105 of 122 sub- 
jects (through leave-one-out classification procedure) by fitting MV-logistic regression, 
while the best result of Li and others (2010) from dimension folding (a dimension reduc- 
tion technique that preserves the matrix structure of covariates) followed by quadratic 
discriminant analysis gives 97. We believe the reasons of a better performance for MV- 
logistic regression are twofold. First, we adopt a different data preprocessing technique 
GLRAM in Step 1, where Li and others (2010) use a version of (2D)^PCA (Zhang and 
Zhou, 2005). Hung and others (2011) show that GLRAM is asymptotically more efficient 
than (2D)^PCA in extracting bases, and hence it is reasonable for GLRAM to produce 
a better result. Second, standardization in Step 2 makes the EEG Database Data Set 
more suitable to fit MV-logistic regression model. Without standardization, MV-logistic 
regression cannot produce such a high classification accuracy. This also reflects that stan- 
dardization is an important issue before fitting MV-logistic regression model. We remind 
the readers again that standardization of covariates will result in a different MV-logistic 
regression model. 

For comparison, the po9o extracted covariates in Step 1 are also fitted with the con- 
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ventional (penalized) logistic regression of Le Cessie and Van Houwelingen (1992), where 
the tuning parameter is selected in the same way such that the leave-one-out classifi- 
cation accuracy is maximized. Table [3] provides the analysis results of both methods 
under different choices of {po,qo)- One can see that MV-logistic regression uniformly 
outperforms conventional logistic regression. Moreover, the classification accuracy of the 
conventional approach decays rapidly as the numbers of po and go increase, while those 
of MV-logistic regression roughly remain constant. As mentioned previously, MV-logistic 
regression requires fewer parameters in model fitting, possesses certain robustness against 
model violation and, hence, an efficiency gain is reasonably expected. 

Remark 5.1. As suggested by one referee, we also compare our approach to the widely 
used procedure, principal component analysis (PC A) followed by logistic regression. In 
particular, PC A is applied to the vectorized covariates vec(X*). The leading r principal 
components are then used to fit the conventional (penalized) logistic regression model, 
where the tuning parameter is also selected to maximize the leave-one-out classification 
accuracy (see Table \^ for the results). It can be seen this widely applied approach can 
not produce classification accuracy higher than 0.820, while the best result of MV-logistic 
regression is 0.861. This reveals the limitation of the conventional yec{X)-based approach 
which usually produces a large number of parameters. 

6 Extension to Multi-Class Response 

We illustrate the extension of MV-logistic regression to the case of H classes with H > 2. 
Let G {1, ■ ■ ■ , H} be the random variable indicating which class the i-th subject belongs 
to. We can equivalently code Yi as Yi = (Yu, ■ ■ ■ , Yn-i^i)'^, where Yhi = 1 indicates the 
i-th subject belongs to category h, h = 1, ■ ■ ■ , H — 1, and {Yhi = 0, h = 1, ■ ■ ■ , H — 1} 
means the i-th subject belongs to category H. Consider the model 



with ah = (l,a^"^)"^ for the sake of identifiability as before. The parameter of interest is 
6 = {9j, ■ ■ ■ , Ojj^iY with 6h = (7/1, f^h)'^- This gives the covariate-specific probability 



P{Yi = h\Xi) to be Tihi = 7r/,(0|Xi) = exp(7/, + alXi(3h){l + Y.h=i exp(7h + "h^A)} ^ 




(16) 



15 



1=1 



foi h = !,■■■ ,H-l, and irm = 7rH{0\Xi) = {1 + Y.h=i exp(7h, + alXif3h)}~^ for h = H. 
Based on the data {(Fj, Xj)}JL^, the fog-hkehhood function of 6 is derived to be 

'H-l H-l 

^hiilh + alX.Ph) - log{l + e^vilh + oilXiPh)} ■ (17) 

_h=l h=l 

Then, 6 is proposed to be estimated by Ox = argmaxg i*x{0), where i*xW — ^*(^) — AJ(0). 

To apply the Newton method to obtain Ox, we need to calculate the gradient vector and 
Hessian matrix of £1(0). Let X*{0) = diag{X(^i), ■ ■ ■ ,X(^h-i)}, where X(-) is defined 
in the beginning of Section [31 Let also Yh = {Yhi, ■ ■ ■ , V^n)^, ^h{0) = {iThi, ■ ■ ■ , i^hn)'^ 
foTh=l,---,H-l,Y* = (Yf , ■ ■ ■ , Yj,_,f, and U*{0) = {Ur{Of, Un-iiOfV. 
Then, the gradient is derived to be fx^^\0) = X*{Of{Y* - U*{0)} - XJ^^\0), and 
the Hessian matrix (after ignoring the zero expectation term as in fllOl) ) is given by 
HliO) = X*{OfV*{0)X*{0) + AJ(2)(0), where V*{0) = [V,,{0)], Vhh = diag{7rM(l - 
TTh.i), ■ ■ ■ ,7r/i„(l -7r/,„,)}, /i = 1, ■ ■ ■ 1, and Vhk = Vkh = diag(-7rh,i7rfci, ■ ■ ■ , -nhnT^kn), 

I <h ^ k < H - I. Finally, is obtained by replacing P and ^ with t^^\0) and 
Hl{0) in the iteration (fTTj). 

By a similar argument of Theorem 13 .11 we can also deduce that ^Jn{Ox — 0) converges 
weakly to a normal distribution with mean zero and covariance matrix S*(0) = {/*(^)}~^, 
where I*{0) = E[n-^X*{OfV*{0)X*{0)]. Moreover, S*(6>) can be estimated by S*(^a), 
where 

S*(6») = (^^HliO)^ (^h<.*{0)^V*{0)X*{0)^ (^^HliO)^ . (18) 

Inference procedures are the same with what we have already established in the previous 
sections. 



7 Conclusions 

Besides the EEG example, tensor objects are frequently encountered in many applications, 
such as mammography images, ultrasound images, and magnetic resonance imaging (MRI) 
images, which are examples of order-two tensors. One can also imagine a subject with p 
covariates measured on q time points under t different treatments is naturally stored as 
an order-three tensor. In this paper we propose MV-logistic regression model, when the 
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covariates have a natural matrix structure (order-two tensor). Its performance is validated 
through the EEG Database Data Set, where in the best case of {po, go) = (15, 15) we 
successfully classify 105 of 122 subjects to the right group. The superiority of MV-logistic 
regression comes from the fact that it aims to estimate the best rank-1 approximation 
of the true parameter matrix and the parsimony of parameters used. It is also found 
in our simulation studies that MV-logistic regression has certain robustness against the 
violation of model specification. Thus, MV-logistic regression model can be used as a 
good "working" model, especially when the sample size is relatively small in comparison 
with the number of covariates. Although we focus on matrix covariate in this article, the 
proposed method can be straightforwardly extended to tensor objects of higher order. 

As discussed in Section 2.2, MV-logistic regression model aims to approximate the 
true parameter matrix in the sense of minimum KL-divergence. Here the discussion of 
MV-logistic regression model is only focused on the case of rank-1 approximation due 
to its simplicity in implementation and theoretical development, and we find this simple 
rank-1 model is suitable for the EEG Database Data Set. Of course we may encounter 
situations when rank-1 approximation is not adequate. It is therefore natural to extend 
this rank-1 structure to a more general rank-r setting. In particular, for any positive 
integer r < p A q, consider the model 

logit{P(F = 1|X)} = 7 + vec(A5^)^vec(X), (19) 

where A G W^^^' and B G M^^^' are the row and column parameter matrices, respectively. 
Note that model (fT9|) can be treated as the conventional logistic regression model ([1]) with 
the constraint C, = AB^ , and will reduce to our MV-logistic regression model ([2]) when 
r = 1. For the general rank-r setting, however, some issues need to be further studied, such 
as the problem of identifiability of parameters in [A, B) and the corresponding asymptotic 
properties. Moreover, the algorithm developed in this paper can not be directly applied, 
and also need to be revised to adapt to model (HM . We believe model (ITI?]) has its wide 
applicability in practice, and is of great interest to investigate in a future study. 
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Table 1: Averages of 9\ (Mean), averages of diagonal elements of S(^a) (SE), and standard 
deviations of 9x (SD) under model ([2]) with (p, q) = (12, 10). The last row gives averages 
(standard deviations) of similarities (SIM) between (7,/?"^ ® a'^)^ and (7, f3'^ ® a'^)'^ 









n = 150 






n = 300 








True 


Mean 


SD 


SE 


Mean 


SD 


SE 


7 


1.000 


1.095 


0.447 


0.460 


1.055 


0.289 


0.287 


a* 


0.500 


0.549 


0.165 


0.157 


0.525 


0.098 


0. 


,096 




-0.500 


-0.543 


0.169 


0.156 


-0.524 


0.103 


0, 


,096 




-0.500 


-0.558 


0.162 


0.158 


-0.525 


0.100 


0. 


,095 




-0.500 


-0.557 


0.170 


0.157 


-0.528 


0.095 


0, 


,095 




-0.500 


-0.560 


0.168 


0.157 


-0.524 


0.101 


0, 


,096 




-0.500 


-0.550 


0.165 


0.156 


-0.524 


0.094 


0, 


,095 




-0.500 


-0.557 


0.168 


0.157 


-0.523 


0.096 


0. 


,095 




-0.500 


-0.546 


0.161 


0.157 


-0.525 


0.098 


0, 


,096 




-0.500 


-0.557 


0.164 


0.157 


-0.527 


0.097 


0. 


,096 
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-0.555 


0.163 


0.157 


-0.521 


0.097 


0, 


,095 




-0.500 


-0.554 
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-0.526 
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0, 
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/3 


1.000 


0.972 


0.226 


0.237 


0.995 


0.174 


0, 


,177 




0.500 


0.485 


0.206 
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0.506 
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0, 
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0.170 


0. 


,178 
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-1.008 


0.171 


0, 
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-0.974 
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0.172 


0, 
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-0.969 
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-0.996 


0.172 


0, 


.177 




-1.000 


-0.962 


0.215 
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-1.008 


0.169 


0, 


.178 




-1.000 


-0.968 


0.223 
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-0.996 
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0, 
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-1.000 


-0.966 


0.220 


0.238 


-1.012 


0.171 


0, 


.179 




-1.000 


-0.959 


0.216 


0.237 


-1.000 


0.172 


0, 


.178 


SIM 




0.950 


(0.021) 




0.981 


(0.007) 
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Table 2: Averages of classification accuracies (CA) of MV-logistic/Logistic regression and 
winning proportion (WP) of MV-logistic regression for different values of a, p, q and 
n = 150. Per is the average of the explained proportion of the best rank-1 approximation. 







(12,10) 




(20, 15) 




(20,30) 


a 


Pa 


CA (WP) 


Pa 


CA (WP) 


Pa 


CA (WP) 





100% 


0.867/0.739 (1.00) 


100% 


0.866/0.670 (1.00) 


100% 


0.839/0.622 (1.00) 


0.1 


69% 


0.856/0.741 (1.00) 


63% 


0.848/0.668 (1.00) 


58% 


0.821/0.622 (1.00) 


0.3 


44% 


0.794/0.745 (0.87) 


36% 


0.765/0.669 (0.94) 


32% 


0.720/0.623 (0.94) 


0.5 


33% 


0.727/0.751 (0.37) 


26% 


0.677/0.673 (0.60) 


22% 


0.631/0.621 (0.61) 



Table 3: The leave-one-out classification accuracy of MV-logistic/Logistic regression for 
the EEC Database Data Set under different combinations of {po, go) • 



Po 






Qo 






15 




20 




30 


15 


0.861/0.803 


0.: 


836/0.795 


0.! 


844/0.787 


30 


0.844/0.779 


0.: 


828/0.779 


0.! 


828/0.754 


60 


0.844/0.737 


0.: 


853/0.713 


0.! 


828/0.713 



Table 4: The leave-one-out classification accuracy of PCA followed by conventional logistic 
regression for the EEC Database Data Set under different choices of r . 



r 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


Accuracy 


0.631 


0.648 


0.730 


0.820 


0.820 


0.803 


0.787 


0.779 


0.787 


0.795 


r 


20 


30 


40 


50 


60 


70 


80 


90 


100 


120 


Accuracy 


0.746 


0.779 


0.795 


0.754 


0.738 


0.746 


0.713 


0.705 


0.631 


0.492 
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Figure 1: Analysis results of the EEG Database Data Set with (po, Qo) = (15, 15). (a)-(b) 
Estimates of a and /3 (the circles) with the 95% confidence intervals (the vertical bars). 
As we set = 1, no confidence interval is provided for a^. (c) Estimates of TT{6\Xi) (the 
symbol *) with 95% confidence intervals (the horizontal lines). Subjects 1-45 and 46-122 
belong to the control and alcoholic groups, respectively. The vertical dash line indicates a 
probability of value 0.5. (d) Kernel density estimates of a'^Xifi for control and alcoholic 
groups. 
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